R is great for data visualization. One of the amazing packages in visualization in R is the ggplot2. Hadley Wickham created ggplot2 in 2005 as an implementation of Leland Wilkinson’s Grammar of Graphics. It divides graphs into semantic components like scales and layers.
After reading this post, you’ll be able to create beautiful scatter
plots like the one below.
Install library using install.packages("ggplot2").
If you’ve already installed it in your computer then load it -
library("ggplot2")
Loading the data set and do some changes to make it usable -
college <- read.csv('Data/college.csv', stringsAsFactors = TRUE)
You can get the data set here
Calling ggplot() function alone just creates a blank
canvas -
ggplot()
Adding geom_point layer to the ggplot object to create a
scatter plot -
Adding a layer to the ggplot object with argument
geom='point' -
ggplot(data = college) +
layer(geom = 'point', stat = "identity", position = "identity",
mapping = aes(x = tuition, y = sat_avg))
But the easier and widely used way of adding a layer is using
geom_* -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg))
You can change the shape of the points from black dot to something else. For example -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg),
shape = 1)
You can use different shapes for different values/levels. For example
in our college data there is a column named
control, that has the information on whether a school is
public or private.
So if you want to differentiate public vs. private schools but shape
you can do that using the shape argument inside the
aesthetic mapping -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg, shape = control))
You can change the color of the points from black dot to something else. For example -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg),
color = 'darkorchid1')
You can know all the color names by running the code
colors()
Similar to changing shape based on the levels of a variable, you can also change color. For this you need to pass the color argument inside the aesthetic mapping specifying the variable name -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg, color = control))
Now you can clearly see how the private and public schools are performing.
You can assign colors of your choice to plot using the function
scale_color_manual() -
manu_colors <- c("#FF8C32", "#06113C")
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg, color = control))+
scale_color_manual(values = manu_colors)
You can hide the legend using the argument show.legend
outside of the aesthetic mapping -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg, color = control),
show.legend = FALSE)+
scale_color_manual(values = manu_colors)
colourpicker Addin for choosing colorView this link for details on how to install and use this.
CPCOLS <- c("#8B0A50", "#9A32CD")
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg, color = control))+
scale_color_manual(values=CPCOLS)
You can change the size of the points from regular size to something else. For example -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg), size = 2)
Let’s alter the size of pointers in accordance to the number of undergraduates in each point -
ggplot(data = college) +
geom_point(aes(x = tuition, y = sat_avg, size = undergrads))
The transparency of the points can be controlled using the argument
alpha outside of the aesthetic mapping -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg, size = undergrads),
alpha = 0.35)
alpha takes values from 0 to 1.
Notice how transparency of the points in the legend also changes. To
remove any transparency we can use the guides() function to
override the aesthetic of the point -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
size = undergrads),
alpha = 0.35) +
guides(size = guide_legend(override.aes = list(alpha = 1)))
Add title and subtitle using the ggtitle -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
color = control, size = undergrads),
alpha = 0.35) +
ggtitle("SAT Average score VS Tuition Fee",
subtitle = "A comparison study")
I prefer using labs() because it gives more space to
customization, for example changing label of legends -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
color = control, size = undergrads),
alpha = 0.35) +
labs(title = "SAT Average score VS Tuition Fee",
subtitle = "A comparison study")
To align the title and subtitle in the middle you can customize the theme manually -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
color = control, size = undergrads),
alpha = 0.35) +
labs(title = "SAT Average score VS Tuition Fee",
subtitle = "A comparison study",
color = "Control", size = "No. of Undergrads") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
The title and subtitle by default is plotted on the panel. And so the
alignment is done based on the panel. If you want to align the plot
based on ‘plot’ than you have to specify it using
plot.title.position argument inside the function
theme() -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
color = control, size = undergrads),
alpha = 0.35) +
labs(title = "SAT Average score VS Tuition Fee",
subtitle = "A comparison study",
color = "Control", size = "No. of Undergrads") +
theme(plot.title.position = "plot",
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
Like ggtile() there are functions called
xlab() and ylab() that can be used to change
axis labels -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
color = control, size = undergrads),
alpha = 0.35) +
xlab("Tuition Fees") +
ylab("SAT Average Score")
But the labs() function with arguments x
and y may seem more convenient -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
color = control, size = undergrads),
alpha = 0.35) +
labs(title = "SAT Average score VS Tuition Fee",
subtitle = "A comparison study",
color = "Control", size = "No. of Undergrads",
x = "Tuition Fees", y = "SAT Average Score")
Using xlim and ylim you can specify the
limits -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
color = control, size = undergrads),
alpha = 0.35) +
labs(title = "SAT Average score VS Tuition Fee",
subtitle = "A comparison study",
color = "Control", size = "No. of Undergrads",
x = "Tuition Fees", y = "SAT Average Score") +
xlim(0, 60000) + ylim(700, 1500)
When restricting the axis, some of the values may have removed from the plot.
The expand_limits() function does the same work. It does
not remove the points, rather adjusts the limits to include all the
points -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
color = control, size = undergrads),
alpha = 0.35) +
labs(title = "SAT Average score VS Tuition Fee",
subtitle = "A comparison study",
color = "Control", size = "No. of Undergrads",
x = "Tuition Fees", y = "SAT Average Score") +
expand_limits(x = c(0, 60000), y = c(700, 1500))
This function can be used to manually set the axis labels, breaks, limits and many more. For example -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
color = control, size = undergrads),
alpha = 0.35) +
labs(title = "SAT Average score VS Tuition Fee",
subtitle = "A comparison study",
color = "Control", size = "No. of Undergrads") +
scale_x_continuous(name = "Tuition Fees",
limits = c(0, 56000), # to change limit
breaks = seq(0, 56000, by = 8000), # to specify breaks
labels = scales::dollar
) +
scale_y_continuous(name = "SAT Average Score")
To know more run ?scale_x_continuous.
The caption appears in the bottom-right, and is often used for sources, notes or copyright -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
color = control, size = undergrads),
alpha = 0.35) +
labs(title = "SAT Average score VS Tuition Fee",
subtitle = "A comparison study",
color = "Control", size = "No. of Undergrads",
x = "Tuition Fees", y = "SAT Average Score",
caption = "Source: U.S. Department of Education") +
theme(plot.caption.position = "plot")
The plot tag appears at the top-left, and is typically used for labelling a subplot -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
color = control, size = undergrads),
alpha = 0.35) +
labs(title = "SAT Average score VS Tuition Fee",
subtitle = "A comparison study",
color = "Control", size = "No. of Undergrads",
x = "Tuition Fees", y = "SAT Average Score",
tag = "A")
You can also customize the legends!
Using the labs() function to change the titles of the
legends -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
color = control, size = undergrads),
alpha = 0.35) +
labs(title = "SAT Average score VS Tuition Fee",
subtitle = "A comparison study",
# changing legend labels
color = "Control", size = "No. of Undergrads"
)
Another way to change the titles is using guides()
function -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
color = control, size = undergrads),
alpha = 0.35) +
labs(title = "SAT Average score VS Tuition Fee",
subtitle = "A comparison study") +
guides(color = guide_legend(title = "Control"),
size = guide_legend(title = "No. of Undergrads"))
To hide the legend titles use element_blank() -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
color = control, size = undergrads),
alpha = 0.35) +
labs(title = "SAT Average score VS Tuition Fee",
x = "Tuition Fees", y = "SAT Average Score") +
theme(legend.title = element_blank())
To place at the bottom, one under another, and reduce the margin -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
color = control, size = undergrads),
alpha = 0.35) +
labs(title = "SAT Average score VS Tuition Fee",
x = "Tuition Fees", y = "SAT Average Score") +
theme(legend.position = "bottom",
legend.box = "vertical",
legend.margin=margin())
The legend.position argument takes the values: right,
left, bottom, top, none.
More example -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
color = control, size = undergrads),
alpha = 0.35) +
labs(title = "SAT Average score VS Tuition Fee",
x = "Tuition Fees", y = "SAT Average Score") +
theme(legend.position = "right",
legend.box = "horizontal",
legend.margin=margin())
To justify the contents of a legend’s box use
legend.box.just argument -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
color = control, size = undergrads),
alpha = 0.35) +
labs(title = "SAT Average score VS Tuition Fee",
subtitle = "A comparison study",
color = "Control", size = "No. of Undergrads",
x = "Tuition Fees", y = "SAT Average Score") +
theme(legend.position = "bottom",
legend.box = "vertical",
legend.margin = margin(),
legend.box.just = "left")
To hide the legend -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
color = control, size = undergrads),
alpha = 0.35) +
labs(title = "SAT Average score VS Tuition Fee",
x = "Tuition Fees", y = "SAT Average Score") +
theme(legend.position = "none")
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
color = control, size = undergrads),
alpha = 0.35) +
labs(title = "SAT Average score VS Tuition Fee",
x = "Tuition Fees", y = "SAT Average Score") +
theme(legend.position = "right",
legend.box = "horizontal",
legend.margin=margin())
Using the guides() function, you will be able to assign
order in which the legends will be shown.
For example -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
color = control, size = undergrads),
alpha = 0.35) +
labs(title = "SAT Average score VS Tuition Fee",
x = "Tuition Fees", y = "SAT Average Score") +
guides(colour = guide_legend(order = 1),
size = guide_legend(order = 2))
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
color = control, size = undergrads),
alpha = 0.35) +
labs(title = "SAT Average score VS Tuition Fee",
x = "Tuition Fees", y = "SAT Average Score") +
guides(colour = guide_legend(order = 2),
size = guide_legend(order = 1))
element_rect() function with fill argument
in action in changing colors of different parts -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
color = control, size = undergrads),
alpha = 0.35) +
labs(x = "Tuition Fees", y = "SAT Average Score",
title = "SAT Average score VS Tuition Fee",
subtitle = "A comparison study",
caption = "Source: U.S. Department of Education",
tag = "A"
) +
theme(plot.caption.position = "plot",
plot.background = element_rect(fill='#E2D784'),
panel.background = element_rect(fill = '#E5EFC1'),
legend.background = element_rect(fill = '#E2D784'),
legend.key = element_rect(fill = "#E2D784")
) +
guides(color = guide_legend(override.aes = list(alpha = 1, size = 4)),
size = guide_legend(override.aes = list(alpha = 1)))
Use element_blank() to remove all grids and colors from
background -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
color = control, size = undergrads),
alpha = 0.35) +
labs(x = "Tuition Fees", y = "SAT Average Score",
title = "SAT Average score VS Tuition Fee")+
theme(panel.background = element_blank())
Showing both grids in a single color using
panel.grid.major -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
color = control, size = undergrads),
alpha = 0.35) +
labs(x = "Tuition Fees", y = "SAT Average Score",
title = "SAT Average score VS Tuition Fee")+
theme(panel.background = element_blank(),
panel.grid.major = element_line("grey"))
Showing only X axis grid using panel.grid.major.x-
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
color = control, size = undergrads),
alpha = 0.35) +
labs(x = "Tuition Fees", y = "SAT Average Score",
title = "SAT Average score VS Tuition Fee")+
theme(panel.background = element_blank(),
panel.grid.major.x = element_line("grey"))
Similarly Y axis grid -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
color = control, size = undergrads),
alpha = 0.35) +
labs(x = "Tuition Fees", y = "SAT Average Score",
title = "SAT Average score VS Tuition Fee")+
theme(panel.background = element_blank(),
panel.grid.major.y = element_line("grey"))
To hide grids use element_blank() -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
color = control, size = undergrads),
alpha = 0.35) +
labs(x = "Tuition Fees", y = "SAT Average Score",
title = "SAT Average score VS Tuition Fee")+
theme(panel.background = element_blank(),
panel.grid.major = element_blank())
List of themes available in ggplot: * theme_bw() * theme_minimal() * theme_linedraw() * theme_light() * theme_dark() * theme_classic() * theme_void() * theme_test()
Using classic theme -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
color = control, size = undergrads),
alpha = 0.35) +
labs(x = "Tuition Fees", y = "SAT Average Score",
title = "SAT Average score VS Tuition Fee") +
theme_classic()
Using minimal theme -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
color = control, size = undergrads),
alpha = 0.35) +
labs(x = "Tuition Fees", y = "SAT Average Score",
title = "SAT Average score VS Tuition Fee") +
theme_minimal()
Using dark theme -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
color = control, size = undergrads),
alpha = 0.35) +
labs(x = "Tuition Fees", y = "SAT Average Score",
title = "SAT Average score VS Tuition Fee") +
theme_bw()
More themes can be found from the package ggthemes. Load
the package -
library(ggthemes)
Details on the themes can be found here.
Using the theme solarized -
ggplot(data = college) +
geom_point(mapping = aes(x = tuition, y = sat_avg,
color = control, size = undergrads),
alpha = 0.35) +
labs(x = "Tuition Fees", y = "SAT Average Score",
title = "SAT Average score VS Tuition Fee") +
theme_solarized()